59 research outputs found

    Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks

    Get PDF
    Deep convolutional neural networks (CNNs) are widely used in modern AI systems for their superior accuracy but at the cost of high computational complexity. The complexity comes from the need to simultaneously process hundreds of filters and channels in the high-dimensional convolutions, which involve a significant amount of data movement. Although highly-parallel compute paradigms, such as SIMD/SIMT, effectively address the computation requirement to achieve high throughput, energy consumption still remains high as data movement can be more expensive than computation. Accordingly, finding a dataflow that supports parallel processing with minimal data movement cost is crucial to achieving energy-efficient CNN processing without compromising accuracy. In this paper, we present a novel dataflow, called row-stationary (RS), that minimizes data movement energy consumption on a spatial architecture. This is realized by exploiting local data reuse of filter weights and feature map pixels, i.e., activations, in the high-dimensional convolutions, and minimizing data movement of partial sum accumulations. Unlike dataflows used in existing designs, which only reduce certain types of data movement, the proposed RS dataflow can adapt to different CNN shape configurations and reduces all types of data movement through maximally utilizing the processing engine (PE) local storage, direct inter-PE communication and spatial parallelism. To evaluate the energy efficiency of the different dataflows, we propose an analysis framework that compares energy cost under the same hardware area and processing parallelism constraints. Experiments using the CNN configurations of AlexNet show that the proposed RS dataflow is more energy efficient than existing dataflows in both convolutional (1.4x to 2.5x) and fully-connected layers (at least 1.3x for batch size larger than 16). The RS dataflow has also been demonstrated on a fabricated chip, which verifies our energy analysis

    Penetrating Shields: A Systematic Analysis of Memory Corruption Mitigations in the Spectre Era

    Full text link
    This paper provides the first systematic analysis of a synergistic threat model encompassing memory corruption vulnerabilities and microarchitectural side-channel vulnerabilities. We study speculative shield bypass attacks that leverage speculative execution attacks to leak secrets that are critical to the security of memory corruption mitigations (i.e., the shields), and then use the leaked secrets to bypass the mitigation mechanisms and successfully conduct memory corruption exploits, such as control-flow hijacking. We start by systematizing a taxonomy of the state-of-the-art memory corruption mitigations focusing on hardware-software co-design solutions. The taxonomy helps us to identify 10 likely vulnerable defense schemes out of 20 schemes that we analyze. Next, we develop a graph-based model to analyze the 10 likely vulnerable defenses and reason about possible countermeasures. Finally, we present three proof-of-concept attacks targeting an already-deployed mitigation mechanism and two state-of-the-art academic proposals.Comment: 14 page

    Towards Closing the Energy Gap Between HOG and CNN Features for Embedded Vision

    Get PDF
    Computer vision enables a wide range of applications in robotics/drones, self-driving cars, smart Internet of Things, and portable/wearable electronics. For many of these applications, local embedded processing is preferred due to privacy and/or latency concerns. Accordingly, energy-efficient embedded vision hardware delivering real-time and robust performance is crucial. While deep learning is gaining popularity in several computer vision algorithms, a significant energy consumption difference exists compared to traditional hand-crafted approaches. In this paper, we provide an in-depth analysis of the computation, energy and accuracy trade-offs between learned features such as deep Convolutional Neural Networks (CNN) and hand-crafted features such as Histogram of Oriented Gradients (HOG). This analysis is supported by measurements from two chips that implement these algorithms. Our goal is to understand the source of the energy discrepancy between the two approaches and to provide insight about the potential areas where CNNs can be improved and eventually approach the energy-efficiency of HOG while maintaining its outstanding performance accuracy

    Hardware for Machine Learning: Challenges and Opportunities

    Get PDF
    Machine learning plays a critical role in extracting meaningful information out of the zetabytes of sensor data collected every day. For some applications, the goal is to analyze and understand the data to identify trends (e.g., surveillance, portable/wearable electronics); in other applications, the goal is to take immediate action based the data (e.g., robotics/drones, self-driving cars, smart Internet of Things). For many of these applications, local embedded processing near the sensor is preferred over the cloud due to privacy or latency concerns, or limitations in the communication bandwidth. However, at the sensor there are often stringent constraints on energy consumption and cost in addition to throughput and accuracy requirements. Furthermore, flexibility is often required such that the processing can be adapted for different applications or environments (e.g., update the weights and model in the classifier). In many applications, machine learning often involves transforming the input data into a higher dimensional space, which, along with programmable weights, increases data movement and consequently energy consumption. In this paper, we will discuss how these challenges can be addressed at various levels of hardware design ranging from architecture, hardware-friendly algorithms, mixed-signal circuits, and advanced technologies (including memories and sensors).United States. Defense Advanced Research Projects Agency (DARPA)Texas Instruments IncorporatedIntel Corporatio

    Memory dependence prediction using store sets

    Full text link

    Penerapan Model Pembelajaran Problem Based Learning (Pbl) untuk Meningkatkan Hasil Belajar Siswa pada Pokok Bahasan Kelarutan dan Hasil Kali Kelarutan di Kelas XI IPA SMA Negeri 1 Kampar

    Get PDF
    Research was aimed to improve study result of student on solubility and result times solubility subject has been doing in class XI science SMAN 1 Kampar. This research was a form of experiments research with design pretest-posttest. Sample of the research were student of class XI science 2 as experimental class and class XI science 3 as control class. The experimental class was applied learning model problem base learning while the control class using discussion method. Data were analized using t- test. Result from the data analysis showed t count > t table (1,6923 > 1,68). It means learning model problem based learning can increase study result of student on solubility and result times solubility subject in class XI science SMAN 1 Kampar with category of increase study results on the solubility and result times solubility in class XI science is high category

    TeAAL: A Declarative Framework for Modeling Sparse Tensor Accelerators

    Full text link
    Over the past few years, the explosion in sparse tensor algebra workloads has led to a corresponding rise in domain-specific accelerators to service them. Due to the irregularity present in sparse tensors, these accelerators employ a wide variety of novel solutions to achieve good performance. At the same time, prior work on design-flexible sparse accelerator modeling does not express this full range of design features, making it difficult to understand the impact of each design choice and compare or extend the state-of-the-art. To address this, we propose TeAAL: a language and compiler for the concise and precise specification and evaluation of sparse tensor algebra architectures. We use TeAAL to represent and evaluate four disparate state-of-the-art accelerators--ExTensor, Gamma, OuterSPACE, and SIGMA--and verify that it reproduces their performance with high accuracy. Finally, we demonstrate the potential of TeAAL as a tool for designing new accelerators by showing how it can be used to speed up Graphicionado--by 38Ă—38\times on BFS and 4.3Ă—4.3\times on SSSP.Comment: 14 pages, 12 figure

    CAMP: A technique to estimate per-structure power at run-time using a few simple parameters

    Full text link
    Microprocessor power has become a first-order constraint at run-time. Designers must employ aggressive power-management techniques at run-time to keep a processor’s ballooning power requirements under control. Effective power management benefits from knowledge of run-time microprocessor power consumption in both the core and individual microarchitectural structures, such as caches, queues, and execution units. Increasingly feasible per-structure power-control techniques, such as fine-grain clock gat-ing, power gating, and dynamic voltage/frequency scaling (DVFS), become more effective from run-time estimates of per-structure power. However, run-time computation of per-structure power esti-mates based on utilization requires daunting numbers of input sta-tistics, which makes per-structure monitoring of run-time power a challenging problem

    A comparative study of arbitration algorithms for the Alpha 21364 pipelined router

    Full text link
    Interconnection networks usually consist of a fabric of interconnected routers, which receive packets arriving at their input ports and forward them to appropriate output ports. Unfortunately, network packets moving through these routers are often delayed due to conflicting demand for resources, such as output ports or buffer space. Hence, routers typically employ arbiters that resolve conflicting resource demands to maximize the number of matches between packets waiting at input ports and free output ports. Efficient design and implementation of the algorithm running on these arbiters is critical to maximize network performance.This paper proposes a new arbitration algorithm called SPAA (Simple Pipelined Arbitration Algorithm), which is implemented in the Alpha 21364 processor's on-chip router pipeline. Simulation results show that SPAA significantly outperforms two earlier well-known arbitration algorithms: PIM (Parallel Iterative Matching) and WFA (Wave-Front Arbiter) implemented in the SGI Spider switch. SPAA outperforms PIM and WFA because SPAA exhibits matching capabilities similar to PIM and WFA under realistic conditions when many output ports are busy, incurs fewer clock cycles to perform the arbitration, and can be pipelined effectively. Additionally, we propose a new prioritization policy called the Rotary Rule, which prevents the network's adverse performance degradation from saturation at high network loads by prioritizing packets already in the network over new packets generated by caches or memory.Mukherjee, S.; Silla Jiménez, F.; Bannon, P.; Emer, J.; Lang, S.; Webb, D. (2002). A comparative study of arbitration algorithms for the Alpha 21364 pipelined router. ACM SIGPLAN Notices. 37(10):223-234. doi:10.1145/605432.605421S223234371
    • …
    corecore